docs: parquet-format + socket-analysis guides (re-target to main) by randomizedcoder · Pull Request #41 · randomizedcoder/xtcp2

randomizedcoder · 2026-06-20T03:28:33Z

Adds docs/parquet-format.md — a consumer-facing guide to the S3/Parquet export, written for an enterprise data/analytics team that has only a basic grasp of TCP.

Stacked on #39 (docs/protobuf-formats) so the cross-link to that doc resolves. Merge #39 first; this then retargets to main.

What it covers

File layout — Hive partitioning host=/date=/hour= (UTC), object naming, how engines expose the partitions.
Size / cadence / compression — ~63 MiB uncompressed soft cap (-s3ParquetFlushBytes), per-column ZSTD (strings/bytes) + SNAPPY (numerics).
Reading it — DuckDB, pandas/pyarrow, Trino/Athena snippets with partition pruning and column projection.
The grain — one row per socket per poll; counters are cumulative; track a connection via inet_diag_msg_socket_cookie.
"Start here" columns — the high-value subset with units (RTT µs, snd_cwnd packets, delivery_rate bytes/s, total_retrans, byte counters, congestion algorithm) so the team knows where to focus first.
Decoding cheat sheet — raw-byte IPs via inet_diag_msg_family, the TCP state integer→name map, congestion enum, timestamp_ns.
Full schema grouping + types, proto3 no-NULL/zero-default gotchas, and where the schema is defined (ParquetRow + the drift test that keeps Parquet/proto/ClickHouse in lockstep).

Cross-linked from the docs hub and output-and-destinations.md (S3 section).

Notes

Grounded in the actual code: ParquetRow (destinations_s3parquet_schema.go), the objectKey layout, and the 63 MiB flush cap.
Verified: all relative links (../pkg/…, ../proto/…, sibling docs) resolve; no broken intra-doc anchors.

🤖 Generated with Claude Code

New docs/parquet-format.md explains the S3/Parquet export for an enterprise data/analytics audience consuming xtcp2's TCP telemetry: - Hive partition layout (host=/date=/hour=, UTC) and object naming - file size/cadence (~63 MiB uncompressed soft cap) and per-column compression (ZSTD strings/bytes, SNAPPY numerics) - how to read it (DuckDB/pandas/Trino) with partition pruning - the grain (one row per socket per poll; cumulative counters; socket cookie) - a 'start here' set of the key TCP columns with units (rtt µs, cwnd packets, delivery_rate bytes/s, total_retrans, byte counters, congestion algo) - decoding cheat sheet (raw-byte IPs via family, TCP state map, enums, ts) - full schema grouping + types, proto3 no-null gotchas, and where the schema is defined (ParquetRow + drift test). Cross-linked from the docs hub and output-and-destinations (S3 section). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Short, name-matched COPY INTO recipe (file format + stage + INFER_SCHEMA auto-create + MATCH_BY_COLUMN_NAME), an external-table/AUTO_REFRESH note for continuous ingest, and the two Snowflake gotchas: path-based Hive partitions (derive from metadata$filename) and BINARY address columns.

New docs/socket-analysis.md: a methodology guide for finding the natural RTT bands statistically (min_rtt on a log scale; GMM+BIC for adaptive, drift-aware bands; Jenks/KDE simple alternative; Snowflake quantile quick-win), with labeling/validation against dest ASN/geo and per-DC/over-time tracking. Adds multi-feature clustering (HDBSCAN) and other groupings (throughput, loss, congestion algo, per-ASN, diurnal), a worked SQL→Python example, and a pitfalls section (per-socket grain, cumulative counters, µs units, survivorship, app-limited throughput, drift). Cross-linked from the docs hub and parquet doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs: socket-analysis guide — RTT bands & clustering for data teams

randomizedcoder · 2026-06-20T20:58:41Z

Re-targeted to main. This branch already contains both docs/parquet-format.md and docs/socket-analysis.md (PR #42 was merged into this branch), plus the Snowflake section and exact column count. Merging this lands both docs on main — they never arrived because #41/#42 were stacked on branches that merged first.

randomizedcoder and others added 5 commits June 19, 2026 20:28

docs: soft-wrap parquet-format.md (let the renderer wrap)

59f5649

docs(parquet): state the exact column count (122, not ~120)

bfe6c7a

randomizedcoder mentioned this pull request Jun 20, 2026

docs: socket-analysis guide — RTT bands & clustering for data teams #42

Merged

Merge pull request #42 from randomizedcoder/docs/socket-analysis

cf6267d

docs: socket-analysis guide — RTT bands & clustering for data teams

randomizedcoder changed the title ~~docs: Parquet format reference for data teams~~ docs: parquet-format + socket-analysis guides (re-target to main) Jun 20, 2026

randomizedcoder changed the base branch from docs/protobuf-formats to main June 20, 2026 20:58

randomizedcoder merged commit 69a94f2 into main Jun 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: parquet-format + socket-analysis guides (re-target to main)#41

docs: parquet-format + socket-analysis guides (re-target to main)#41
randomizedcoder merged 6 commits into
mainfrom
docs/parquet-format

randomizedcoder commented Jun 20, 2026

Uh oh!

randomizedcoder commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

randomizedcoder commented Jun 20, 2026

What it covers

Notes

Uh oh!

randomizedcoder commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant